Conversation
src/Monitor/Tracing/Zipkin.hs
Outdated
| else flush | ||
| pure $ Zipkin mgr req tracer mbEpt | ||
| zpk = Zipkin mgr req tracer mbEpt | ||
| for_ (microSeconds <$> mfilter (> 0) mbPrd) $ \delay -> fork $ forever $ do |
There was a problem hiding this comment.
This part is the one I'm the least satisfied with; I took it "as is" from the original PR, but I think it is flawed.
If the mbPrd parameter is Nothing, the behaviour is messed up: the flush is only performed at the termination of the process (i.e., in most case, when quitting the program) - I suspect (without being 100% sure) because of the intricacies of mfilter. Though we typically set this parameter, it would be better not to leave such an aberrant behaviour possible.
There was a problem hiding this comment.
Yeah, that looks really bad. What is mbPrd for? You should not be allowed to not spawn a publishing thread, let's fix this.
There was a problem hiding this comment.
The documentation for the field reads:
, settingsPublishPeriod :: !(Maybe NominalDiffTime)
-- ^ If set to a positive value, traces will be flushed in the background every such period.
Which, I guess, can be interpreted as "if set to Nothing, there is no regular flush", but it makes very little sense (unless for some very specific tasks like daemons that get restarted or things like this ?). I suggest replacing the Nothing case with a sensible default (every seconds, e.g.).
There was a problem hiding this comment.
Yeah, let's just rid of the Maybe there.
There was a problem hiding this comment.
Shall we also remove the ignore flag, since it doesn't get used anymore ?
src/Monitor/Tracing/Zipkin.hs
Outdated
| let unspan (ZipkinSpan _ sample) = sample | ||
| queue = spanSamples tracer | ||
| fillOne item = void . atomically $ do | ||
| cannotPush <- isFullTBQueue queue |
There was a problem hiding this comment.
This check is very important, lest an uncaught exception is raised. Of course, as a result, if the queue is full, there will be some traces lost (which is why I set a fairly high limit to the queue).
src/Control/Monad/Trace.hs
Outdated
| tags <- readTVar tagsTV | ||
| logs <- sortOn (\(t, k, _) -> (t, k)) <$> readTVar logsTV | ||
| writeTChan (tracerChannel tracer) (Sample spn tags logs start (end - start)) | ||
| writeTBQueue (tracerQueue tracer) (Sample spn tags logs start (end - start)) |
There was a problem hiding this comment.
writeTBQueue blocks when the queue is full, it should instead do nothing (see SBQueue, in particular writeSBQueue).
src/Monitor/Tracing/Zipkin.hs
Outdated
| else flush | ||
| pure $ Zipkin mgr req tracer mbEpt | ||
| zpk = Zipkin mgr req tracer mbEpt | ||
| for_ (microSeconds <$> mfilter (> 0) mbPrd) $ \delay -> fork $ forever $ do |
There was a problem hiding this comment.
Yeah, that looks really bad. What is mbPrd for? You should not be allowed to not spawn a publishing thread, let's fix this.
src/Monitor/Tracing/Zipkin.hs
Outdated
| publish zpk `catch` handler | ||
| pure zpk | ||
|
|
||
| refillQueue :: (MonadIO m, MonadBaseControl IO m) => Tracer -> [ZipkinSpan] -> m () |
There was a problem hiding this comment.
For comparison, in log-elasticsearch we simply flush the queue, take all messages and try to send them to ES. If this fails with any exception, we just retry in place with the messages we got, there's no need to put these items back into queue unless I'm missing something.
You also then won't need the PublishFailed exception IIUC.
There was a problem hiding this comment.
I'll see how this behaves with an in-place retry.
src/Control/Monad/Trace.hs
Outdated
| -- | Creates a new 'Tracer'. | ||
| newTracer :: MonadIO m => m Tracer | ||
| newTracer = liftIO $ Tracer <$> newTChanIO <*> newTVarIO 0 | ||
| newTracer = liftIO $ Tracer <$> newTBQueueIO 10000 <*> newTVarIO 0 |
There was a problem hiding this comment.
10000 sounds a bit small, log uses a default of 1000000.
This graph suggests that when the publisher fails, traces are dropped for some reason 🤔 |
That is my understanding too, but it is not the behaviour I observe in Zipkin; the messages are properly retrieved when the server is restarted. This is why I am so puzzled by the profile, it makes no sense to me. I'll retry it once I've switched to the SBQueue you recommended. |
af7dd7f to
58abc4e
Compare
src/Control/Monad/Trace.hs
Outdated
| import Data.Maybe (fromMaybe) | ||
| import Data.Time.Clock (NominalDiffTime) | ||
| import Data.Time.Clock.POSIX (POSIXTime, getPOSIXTime) | ||
| import GHC.Conc (retry) |
There was a problem hiding this comment.
While we're at it, can you get rid of the stm-lifted dependency? It's abandoned and not really needed here, AFAIK the only thing it gives us is lifted atomically.
Then you can use import Control.Concurrent.STM and it imports retry.
There was a problem hiding this comment.
Nevermind, let's do it in a different PR.
You can use Control.Monad.STM though instead of GHC.Conc.
There was a problem hiding this comment.
I can, but (a) it's already a reexport for GHC.Conc (b) it requires adding the stm dependency (which is admittedly only hidden for now). I will gladly do a PR dropping stm-lifted as a follow-up to this one.
There was a problem hiding this comment.
Ok, let's leave it then. I see you already did it ;)
There was a problem hiding this comment.
Yes, I was explaining why I had used GHC.Conc, not refusing the requested change ! If the state of the PR seems satisfactory to you, I think we can merge and I'll do the PR dropping stm-lifted right after.
58abc4e to
f38e4e6
Compare
src/Monitor/Tracing/Zipkin.hs
Outdated
| @@ -1,4 +1,5 @@ | |||
| {-# LANGUAGE CPP #-} | |||
| {-# LANGUAGE ExistentialQuantification #-} | |||
src/Control/Monad/Trace.hs
Outdated
| import Data.Maybe (fromMaybe) | ||
| import Data.Time.Clock (NominalDiffTime) | ||
| import Data.Time.Clock.POSIX (POSIXTime, getPOSIXTime) | ||
| import GHC.Conc (retry) |
f38e4e6 to
a756751
Compare
This PR mostly builds upon the opened draft PR on the original tracing repo (mtth#9) to ensure that the library survives gracefully a failure to connect to the backend.
The main change is to use a bounded queue (reasonably sized) to store samples to be sent to the backend, and to have a logic to re-enqueue messages that could not be sent.
As a result:
I tested with the following dummy code:
I did some profiling on this to ensure this didn't introduce any leak.
With no connection loss, we seem to have quasi-constant space:

With a connection loss, oddly enough, there is a clear variation:

I understand the spike (the queue gets filled), I don't get why it gets way lower afterwards; at least it doesn't leak, I guess, but I'm still a bit puzzled by the result.
Edit after changes: now that I've switched to the SBQueue, the results are much closer to what I expected (though postscript did some odd things in rendering, apologies for the lack of readability).
With no loss, we have - again - constant space:
When losing the backend then replugging it, we observe a spike associated with the queue filling (though a very very long name in the key shrunk the graph, the spike followed by a return to normal is clearly observable):